February 4, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
Recall that the goal of supervised machine learning is to learn some function that generally does a good job of mapping the input features to an outcome.
Generically, we assume that:
\[y = f(\mathbf x) + \epsilon\]
We would like to find an approximation to this function, \(\hat{f}(\mathbf x)\), that minimizes the generalization error:
\[\hat{f}(\mathbf x) = \underset{f^*(\mathbf )}{\text{argmin }} E_\mathcal T[\mathcal L(y - f^*(\mathbf x))]\]
Unfortunately, we don’t get to see this quantity and need to approximate it.
Most commonly, we rely on the empirical risk minimization framework. Given training data:
\[E_\mathcal T[\mathcal L(y - f^*(\mathbf x))] = \expn \mathcal L(y_i - f^*(\mathbf x_i)) + \frac{\lambda}{N} C(f^*(\mathbf x))\]
where \(C()\) is a measure of the complexity of the approximating function.
We also know that generalization error can be decomposed into three components:
Bias: How far is the proposed function from the truth on average?
Irreducible Error: Given our feature set, how much variation in the outcome can we not explain?
Complexity: how complicated is the function and how much capacity does it have to overfit to the training data in a way that misrepresents \(\mathcal T\)?
Only the complexity component is mitigated by the size of the training data!
Let’s think about our ability to uncover the true function as \(N\) gets very large:
The complexity penalty goes away with \(N\)
The irreducible error is there to stay
What we’re left with is bias
Recall that bias in the machine learning context is how far away we are over the entire function from the truth on average
Very complicated classification problem in two dimensions.
Let’s try to fit an ERM minimizing logistic regression.
\[\mathcal L(\bb) = -\expn y_i \log \theta_i + (1 - y_i) \log (1 - \theta_i)\]
\[\theta_i = \sigma(z_i)\]
\[z_i = \mathbf x_i^T \boldsymbol \beta\]
Unfortunately, there is no way for this linear classifier to uncover a curvy decision boundary!
Therefore, our generalization error can never be minimal
The problem is that a linear decision boundary isn’t expressive enough to capture the complexity of the problem!
How can we fix this?
One initial approach is to try to figure out a formula for the decision boundary and use that to alter the learning approach to appropriately capture the true function
The goal of machine learning is to learn from the data algorithmically
Instead, let’s restrict ourselves to learning methods that are universal approximators
Universal Approximator:
A learning method that can approximate any Borel measurable function from one finite-dimensional space to another with any desired nonzero amount of error.
Deciphering the nerd speak:
A universal approximator can uncover any mapping of \(\mathbf x\) to \(y\) that is continuous on a bounded subset of \(\mathbb R^P\)
Even less syllables:
It can learn any reasonable function
IMPORTANT!
Just because it can, doesn’t mean that it will.
Reason 1:
The optimization method used to try to learn the function can’t actually find the function
A practical reason that a universal approximator won’t find the truth.
IMPORTANT!
Just because it can, doesn’t mean that it will.
Reason 2:
The training algorithm may choose the wrong function due to overfitting
All we get is the training data, so a universal approximator may not uncover the minimum generalization error function
Confusing irreducible error for bias.
Regardless, universal approximators are desirable in machine learning
For the kinds of data that we expect to see frequently in more complex problems (think image analysis), universal approximators are capable of uncovering any wiggly functional form
Doesn’t require inducing prior knowledge of the problem in order to achieve generalizable functions
How can we know that something is a universal approximator?
True in most cases:
The learning method could theoretically achieve 0 training error for any training data set
We know that this isn’t really good for generalization
However, if \(N = \infty\), then we would like to be able to learn the true function from the infinite i.i.d. samples from \(\mathcal T\)!
e.g. no bias for the true functional form
In your previous classes, you’ve already discussed some examples of universal approximators
Which learning methods that you know of are universal approximators?
There’s a lot!
KNN
Explicit Polynomial Feature Transformations
Implicit Polynomial Expansions via Infinite Basis Kernels
Regression/Classification Trees
And neural networks with at least one hidden layer
Let’s look over some of these and think about:
How do they work as universal approximators?
How well do we expect that they will generalize with \(N < \infty\)?
Computational time
KNN can represent any function as \(N\) gets large!
Neat.
Why don’t we just use KNN all the time?
Curse of Dimensionality:
For uniform coverage of the feature space, the number of training points (e.g. parameters) needed to fully represent it grows exponentially in the dimensionality.
The number of points (parameters) needed to effectively represent the space for a 1NN classifier is \(10^P\).
This is exponential blow-up in the necessary size of the parameter space.
This is the curse of dimensionality!
The curse of dimensionality is a problem for learning:
When a learning method suffers from the curse of dimensionality, the number of implied parameters explodes relative to the size of training data
More parameters = higher complexity penalty and worse generalization performance. We know that the complexity grows linearly in the number of estimated parameters and the generalization error is roughly:
\[E_\mathcal T[\mathcal L(y - \hat{y})] = \expn \mathcal L(y_i - \hat{y}_i) + \mathcal O \left(\frac{P}{N} \right)\]
Because of this problem, KNN is not expected to generalize very well when the dimensionality of the feature space is large!
KNN also does not scale well in \(N\) when making a prediction.
Prediction Time: \(\mathcal O(K \log N)\)
As the training set gets larger, the time needed to make one prediction grows.
This is not great.
An aside:
Does the most basic linear model suffer from the curse of dimensionality?
\[\hat{y} = \mathbf x^T \boldsymbol \beta\]
Nope.
As \(P\) increases, the number of model parameters increases linearly
The cost:
Lack of expressiveness and high bias for complicated problems.
It turns out, though, that the linear model can be altered to be a universal approximator through clever choices of basis expansion.
One such choice is global polynomials
\[\phi(\mathbf x) = [1,x_1,x_1^2,x_2,x_2^2,x_1^2*x_2,x_2^2x_1,.........]\]
Neat.
Why don’t we do this all the time?
A full polynomial expansion of degree \(d\) with \(P\) features has
\[(P + d) \choose d\]
features.
Using Stirling’s approximation, we can argue that this evaluates to roughly:
\[\frac{P^d}{d!}\]
Now, we’re blowing up exponentially in the degree of the polynomial!
As \(P\) increases and \(d\) is fixed, we’re still blowing up exponentially in the dimensionality of the feature space.
This is neither going to generalize well nor be particularly computationally efficient!
Generalization Gap:
\[\mathcal O \left( \frac{P^d}{Nd!} \right)\]
Training time:
\[\mathcal O \left(\frac{NP^{2d}}{d!} \right)\]
Prediction Time:
\[\mathcal O(P)\]
The problem with explicit polynomial methods is the explosion in the feature set as \(P\) gets large.
Let’s look at a prediction made using this method for a continuous outcome using the OLS estimator:
\[y_0 = \mathbf x_0^T \hat{\boldsymbol \beta} = \mathbf x_0^T (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y\]
Using a neat matrix identity called the Woodbury Identity, we can show that this is equivalent to:
\[y_0 = \mathbf x_0^T \mathbf X^T (\mathbf X \mathbf X^T)^{-1} \mathbf y\]
Letting \(\boldsymbol \alpha = (\mathbf X \mathbf X^T)^{-1} \mathbf y\), we’re left with:
\[y_0 = \mathbf x_0^T \mathbf X^T \boldsymbol \alpha\]
Putting this in sum notation:
\[y_0 = \sum \limits_{i = 1}^N \mathbf x_0^T \mathbf x_i \alpha_i\]
Under a linear model, the prediction at any point is equivalent to the sum of the inner product of the feature vector with each training feature vector multiplied by a weight.
This is important because of Mercer’s Theorem
Let \(\mathbf K\) be a \(N \times N\) positive definite kernel matrix:
\[\mathbf K = \begin{bmatrix} \mathcal K(\mathbf x_1, \mathbf x_1) & \ldots & \mathcal K(\mathbf x_1, \mathbf x_N) \\ \vdots & \vdots & \vdots \\ \mathcal K(\mathbf x_N, \mathbf x_1) & \ldots & \mathcal K(\mathbf x_N, \mathbf x_N) \\ \end{bmatrix} \]
where \(\mathcal K()\) is a symmetric function such that:
\[\sum \limits_{i = 1}^N \sum \limits_{j = 1}^N \mathcal K(\mathbf x_i , \mathbf x_j) c_i c_j \ge 0\]
Mercer’s theorem states that any viable \(\mathcal K(\mathbf x_i , \mathbf x_j)\) can be expressed as:
\[k_{i,j} = \phi(\mathbf x_i)^T \phi(\mathbf x_j)\]
where \(\phi()\) is determined as a function of the eigenvectors of the kernel matrix.
What does this mean?
For certain basis expansions for certain methods (like linear regression), we can express a large basis expansion over \(P\) features, a \(N \times P'\) matrix, as a smaller \(N \times N\) matrix
We can encode some really complicated basis expansions of size \(P'\) in a \(N\) vector.
This is useful because we can substitute \(\phi(\mathbf X)\) with \(\mathbf K\)!
One common example is the polynomial kernel:
\[k_{ij} = (\mathbf x_i^T \mathbf x_j + c)^d\]
which corresponds to feature map for each vector which is a scaled equivalent to the full polynomial expansion!
The most common kernel is the RBF kernel:
\[k_{ij} = \exp \left[- \frac{\| \mathbf x_i - \mathbf x_j\|^2}{2 \gamma^2} \right]\]
where \(\gamma\) is a scale constant.
The RBF kernel has an infinite length feature map that is essentially like a polynomial expansion of infinite degree!
Prior to the computational boom in the early 2000s, RBF kernel regressions/SVMs were the dominant approach to flexible machine learning!
Take any feature matrix and transform is to a \(N \times N\) kernel matrix using the RBF kernel
Train a linear model on the altered feature space
Choose \(\gamma\) that yields the best generalization performance
This works pretty well because it doesn’t suffer from the curse of dimensionality
\(\mathbf K\) is always \(N \times N\)
Estimate \(N\) coefficients
So, the number of parameters increases linearly in the training set size!
Regression with the RBF kernel is a universal approximator!
Why don’t we use this all the time?
Two-fold problem:
Generalization
Computational Time
With \(N < \infty\), the RBF kernel can yield infinitely bad generalization performance
It’s too expressive for its own good
Tune \(\gamma\) and judge using \(K\)-fold CV
Computing and inverting a \(N \times N\) matrix can be real tough on your machine
RBF kernels are useful for modern machine learning methods, but do not scale particularly well!
We’ll see the RBF kernel combined with more scalable modern methods soon
Another regression style approach that yields universal approximations is the set of local approximators
For low-dimensional feature spaces, the most common method of doing this is regression splines
For a single feature, we can express this approximation as follows:
Divide the feature space into \(K\) distinct regions - call the points of division \(\boldsymbol \xi\)
With each region, use the points to fit a local low order polynomial approximation to the training points
With clever design, we can ensure continuity of our prediction function
These methods are great for low dimensional feature spaces, but don’t work well at all for \(P > 4\)
The exception is the step approximation
Extend the idea of a step to a rectangle in the feature space
Just make the prediction the average/majority class in that region
This is the basis of tree based algorithms
Start by making a single binary split on one feature that minimizes some loss criterion
Make a new split
Continue
Stop when we’ve reached full depth and each training point is in its own rectangle!
Referred to as trees since the iterative splits can be organized in a tree like structure.
Tree based methods are universal approximators if they are grown out to full depth
Single decision trees will approximate any function with enough data!
Single decision trees generalize really poorly
Curse of dimensionality:
The more features we have, the more training points we need to draw enough rectangles to effectively represent complicated spaces!
Overfitting everywhere!
Random Forests with Bagging present a really clever solution to this problem that doesn’t ruin the universal approximator properties of this algorithm
By limiting the number of features in each tree to a small number, each decision tree can grow to full depth and generalize well over that feature set!
Then bagging allows us to average over the trees in a meaningful way to get a good approximation to the true function without overfitting.
Random forests are a really powerful algorithm for building predictive models for tabular data
There are attempts to use random forests/boosted trees for complex outputs like text and images
One of the reasons that we don’t use trees all the time for really big data is that they scale pretty poorly in \(N\)
To grow a single decision tree to full depth:
\[\mathcal O(P N \log N)\]
Random forests can be a problem, though
Let \(B\) be the number of trees with feature set size \(P'\) that we grow to full depth and average using the bagging approach
\[\mathcal O(B P' N \log N)\]
We can distribute each subtree over a cluster of CPUs to reduce time a little more
With enough computational power, not that bad
Random Forests also have new GPU implementations that speed up processing time
Random Forests (and boosted trees) work very well in certain settings
They are frequently shown to outperform neural network architectures in many cases!
Where they struggle, though, is with really complex data types
Images
Speech
Text
Places where deep learning has really produced huge improvements in the quality of leaning models!
This is an overview of modern methods that work well in a lot of settings
We have one more major universal approximator architecture to consider in neural networks